2 research outputs found

    Victima: Drastically Increasing Address Translation Reach by Leveraging Underutilized Cache Resources

    Full text link
    Address translation is a performance bottleneck in data-intensive workloads due to large datasets and irregular access patterns that lead to frequent high-latency page table walks (PTWs). PTWs can be reduced by using (i) large hardware TLBs or (ii) large software-managed TLBs. Unfortunately, both solutions have significant drawbacks: increased access latency, power and area (for hardware TLBs), and costly memory accesses, the need for large contiguous memory blocks, and complex OS modifications (for software-managed TLBs). We present Victima, a new software-transparent mechanism that drastically increases the translation reach of the processor by leveraging the underutilized resources of the cache hierarchy. The key idea of Victima is to repurpose L2 cache blocks to store clusters of TLB entries, thereby providing an additional low-latency and high-capacity component that backs up the last-level TLB and thus reduces PTWs. Victima has two main components. First, a PTW cost predictor (PTW-CP) identifies costly-to-translate addresses based on the frequency and cost of the PTWs they lead to. Second, a TLB-aware cache replacement policy prioritizes keeping TLB entries in the cache hierarchy by considering (i) the translation pressure (e.g., last-level TLB miss rate) and (ii) the reuse characteristics of the TLB entries. Our evaluation results show that in native (virtualized) execution environments Victima improves average end-to-end application performance by 7.4% (28.7%) over the baseline four-level radix-tree-based page table design and by 6.2% (20.1%) over a state-of-the-art software-managed TLB, across 11 diverse data-intensive workloads. Victima (i) is effective in both native and virtualized environments, (ii) is completely transparent to application and system software, and (iii) incurs very small area and power overheads on a modern high-end CPU.Comment: To appear in 56th IEEE/ACM International Symposium on Microarchitecture (MICRO), 202

    Sectored DRAM: An Energy-Efficient High-Throughput and Practical Fine-Grained DRAM Architecture

    No full text
    There are two major sources of inefficiency in computing systems that use modern DRAM devices as main memory. First, due to coarse-grained data transfers (size of a cache block, usually 64B between the DRAM and the memory controller, systems waste energy on transferring data that is not used. Second, due to coarse-grained DRAM row activation, systems waste energy by activating DRAM cells that are unused in many workloads where spatial locality is lower than the large row size (usually 8-16KB). We propose Sectored DRAM, a new, low-overhead DRAM substrate that alleviates the two inefficiencies, by enabling fine-grained DRAM access and activation. To efficiently retrieve only the useful data from DRAM, Sectored DRAM exploits the observation that many cache blocks are not fully utilized in many workloads due to poor spatial locality. Sectored DRAM predicts the words in a cache block that will likely be accessed during the cache block's cache residency and: (i) transfers only the predicted words on the memory channel, as opposed to transferring the entire cache block, by dynamically tailoring the DRAM data transfer size for the workload and (ii) activates a smaller set of cells that contain the predicted words, as opposed to activating the entire DRAM row, by carefully operating physically isolated portions of DRAM rows (MATs). Compared to prior work in fine-grained DRAM, Sectored DRAM greatly reduces DRAM energy consumption, does not reduce DRAM throughput, and can be implemented with low hardware cost. We evaluate Sectored DRAM using 41 workloads from widely-used benchmark suites. Sectored DRAM reduces the DRAM energy consumption of highly-memory-intensive workloads by up to (on average) 33% (20%) while improving their performance by 17% on average. Sectored DRAM's DRAM energy savings, combined with its system performance improvement, allows system-wide energy savings of up to 23%
    corecore